library(topGO)
library(knitr) # to use kable()
library(limma) # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')
Introduction
This is the enrichment analysis for genes regulated by hatching. Because I quantified the association of expression with hatching in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with genes ordered by the proportion of expression variance explained by hatching. Note that some genes with low variance may still be highly associated with hatching, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of genes based on the significance of the differential expression between levels of hatching, which does depend on fold change.
Reading the data
Functional annotation is in 2019-07-26. I will also upload two lists of genes, with either proportion of variance explained by hatching or p-value of differential expression test.
tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))
To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the genes identifiers and the numeric values, either the portion of variance explained by hatching or the p-values of the differential expression test. The structure() function adds attributes to an object.
Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues <- structure(tagPValue[,1], names = row.names(tagPValue))
rm(tagVariance, tagPValue)
There are 18598 genes scored with a variance portion and a differential expression p-value. It should actually be the exact same genes. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.
head(annotation)
## tagname goterms
## 1 XLOC_000002 GO:0008417|GO:0016020|GO:0006486
## 2 XLOC_000009 GO:0043130|GO:0005515|GO:0043161
## 3 XLOC_000010 GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 XLOC_000015 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 XLOC_000021 GO:0016567|GO:0006397|GO:0061630|GO:0008270
## 6 XLOC_000036 GO:0005272|GO:0006814|GO:0016020
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)
There are 9691 genes with GO annotations. But the differential expression analysis includes many more genes. In order to include the not-annotated genes in the enrichment analysis, to see if annotated or not annotated genes are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.
for (tag in unique(c(names(PValues), names(Variance)))) {
if (! tag %in% names(allgenes2GO)) {
allgenes2GO <- append(allgenes2GO,
structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
}
}
Using differential expression p-values
Building the topGO object
Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.
I generate three datasets, to analyse each of the three ontologies.
selection <- function(allScores) {return(allScores < 0.05)}
GOdata.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdata.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = PValues,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
| BP |
15162 |
1154 |
| MF |
17429 |
576 |
| CC |
12940 |
291 |
Running the tests
There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.
BP.elim <- runTest(GOdata.BP, algorithm = "elim", statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea <- runTest(GOdata.BP, algorithm = "lea", statistic = "ks")
MF.elim <- runTest(GOdata.MF, algorithm = "elim", statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea <- runTest(GOdata.MF, algorithm = "lea", statistic = "ks")
CC.elim <- runTest(GOdata.CC, algorithm = "elim", statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea <- runTest(GOdata.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
| BP |
elim |
1154 |
23 |
| BP |
weight01 |
1154 |
12 |
| BP |
lea |
1154 |
45 |
| MF |
elim |
576 |
13 |
| MF |
weight01 |
576 |
14 |
| MF |
lea |
576 |
25 |
| CC |
elim |
291 |
6 |
| CC |
weight01 |
291 |
4 |
| CC |
lea |
291 |
9 |
rm(ResultsSummary)
Results
The topGO package does not pay much attention to what terms are significant because they are overrepresented and which ones are underrepresented among the most differentially expressed genes. I think it’s worth separating them, to facilitate the biological interpretation. Note that not all terms listed in the tables below are significant. The scores for the three methods (elim, weight01 and lea) are non-corrected p-values.
Biological process
orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms
BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
BP.all[BP.all$Significant > BP.all$Expected,],
caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
| 3 |
GO:0006486 |
protein glycosylation |
75 |
5 |
2.90 |
4 |
0.0018 |
0.0015 |
0.00177 |
| 4 |
GO:1901264 |
carbohydrate derivative transport |
11 |
2 |
0.43 |
212 |
0.1233 |
0.0029 |
0.12333 |
| 5 |
GO:0015931 |
nucleobase-containing compound transport |
29 |
2 |
1.12 |
498 |
0.4205 |
0.0029 |
0.42046 |
| 6 |
GO:0034314 |
Arp2/3 complex-mediated actin nucleation |
12 |
2 |
0.46 |
8 |
0.0031 |
0.0031 |
0.00315 |
| 8 |
GO:0072330 |
monocarboxylic acid biosynthetic process |
24 |
2 |
0.93 |
595 |
0.5189 |
0.0044 |
0.51887 |
| 11 |
GO:0009968 |
negative regulation of signal transducti… |
9 |
1 |
0.35 |
9 |
0.0036 |
0.0064 |
0.00365 |
| 13 |
GO:0003341 |
cilium movement |
10 |
1 |
0.39 |
30 |
0.0110 |
0.0110 |
0.01100 |
| 14 |
GO:0001522 |
pseudouridine synthesis |
13 |
1 |
0.50 |
34 |
0.0120 |
0.0120 |
0.01201 |
| 15 |
GO:0035435 |
phosphate ion transmembrane transport |
8 |
1 |
0.31 |
35 |
0.0123 |
0.0123 |
0.01230 |
| 16 |
GO:0044341 |
sodium-dependent phosphate transport |
8 |
1 |
0.31 |
36 |
0.0123 |
0.0123 |
0.01230 |
| 23 |
GO:0006030 |
chitin metabolic process |
74 |
3 |
2.86 |
224 |
0.1318 |
0.0191 |
0.13182 |
kable(
BP.all[BP.all$Significant < BP.all$Expected,],
caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
| 1 |
GO:0060271 |
cilium assembly |
46 |
1 |
1.78 |
1 |
8e-07 |
6.9e-06 |
8e-07 |
| 2 |
GO:0046907 |
intracellular transport |
202 |
6 |
7.81 |
439 |
0.3577 |
0.0014 |
0.35774 |
| 7 |
GO:0007264 |
small GTPase mediated signal transductio… |
81 |
2 |
3.13 |
10 |
0.0039 |
0.0032 |
0.00393 |
| 9 |
GO:0043604 |
amide biosynthetic process |
215 |
5 |
8.31 |
696 |
0.6011 |
0.0044 |
0.47885 |
| 10 |
GO:0006418 |
tRNA aminoacylation for protein translat… |
40 |
0 |
1.55 |
13 |
0.0044 |
0.0044 |
0.00445 |
| 12 |
GO:0016579 |
protein deubiquitination |
40 |
1 |
1.55 |
17 |
0.0064 |
0.0064 |
0.00642 |
| 17 |
GO:0006511 |
ubiquitin-dependent protein catabolic pr… |
102 |
2 |
3.94 |
714 |
0.6260 |
0.0135 |
0.62602 |
| 18 |
GO:0006012 |
galactose metabolic process |
6 |
0 |
0.23 |
39 |
0.0143 |
0.0143 |
0.01430 |
| 19 |
GO:0098813 |
nuclear chromosome segregation |
27 |
1 |
1.04 |
436 |
0.3523 |
0.0175 |
0.35227 |
| 20 |
GO:0035556 |
intracellular signal transduction |
173 |
6 |
6.69 |
23 |
0.0096 |
0.0182 |
0.00022 |
| 21 |
GO:1901135 |
carbohydrate derivative metabolic proces… |
300 |
11 |
11.59 |
370 |
0.2625 |
0.0190 |
0.45764 |
| 22 |
GO:0019637 |
organophosphate metabolic process |
204 |
6 |
7.88 |
511 |
0.4352 |
0.0191 |
0.68997 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
Definition = Definition(GOTERM[sigTerms]),
PValue=score(BP.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with',
VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
| GO:0060271 |
cilium assembly |
The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. |
0.0000069 |
| GO:0006486 |
protein glycosylation |
A protein modification process that results in the addition of a carbohydrate or carbohydrate derivative unit to a protein amino acid, e.g. the addition of glycan chains to proteins. |
0.0015156 |
| GO:0034314 |
Arp2/3 complex-mediated actin nucleation |
The actin nucleation process in which actin monomers combine to form a new branch on the side of an existing actin filament; mediated by the Arp2/3 protein complex and its interaction with other proteins. |
0.0031481 |
| GO:0007264 |
small GTPase mediated signal transduction |
Any series of molecular signals in which a small monomeric GTPase relays one or more of the signals. |
0.0032176 |
| GO:0006418 |
tRNA aminoacylation for protein translation |
The synthesis of aminoacyl tRNA by the formation of an ester bond between the 3’-hydroxyl group of the most 3’ adenosine of the tRNA and the alpha carboxylic acid group of an amino acid, to be used in ribosome-mediated polypeptide synthesis. |
0.0044477 |
| GO:0009968 |
negative regulation of signal transduction |
Any process that stops, prevents, or reduces the frequency, rate or extent of signal transduction. |
0.0063763 |
| GO:0016579 |
protein deubiquitination |
The removal of one or more ubiquitin groups from a protein. |
0.0064217 |
I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.
showSigOfNodes(GOdata.BP, score(BP.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 186
## Number of Edges = 347
##
## $complete.dag
## [1] "A graph with 186 nodes."
This is just a example of the most significant GO term:
showGroupDensity(GOdata.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function
orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms
MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
MF.all[MF.all$Significant > MF.all$Expected,],
caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
| 2 |
GO:0008417 |
fucosyltransferase activity |
33 |
4 |
1.24 |
3 |
0.00076 |
0.00076 |
0.00076 |
| 6 |
GO:0016757 |
transferase activity, transferring glyco… |
194 |
10 |
7.29 |
2 |
0.00039 |
0.00259 |
1.2e-06 |
| 7 |
GO:1901505 |
carbohydrate derivative transmembrane tr… |
22 |
2 |
0.83 |
288 |
0.45210 |
0.00293 |
0.45210 |
| 10 |
GO:0004842 |
ubiquitin-protein transferase activity |
76 |
3 |
2.86 |
150 |
0.18724 |
0.00678 |
0.18724 |
| 11 |
GO:0005085 |
guanyl-nucleotide exchange factor activi… |
58 |
3 |
2.18 |
60 |
0.04614 |
0.00722 |
0.04614 |
| 12 |
GO:0008061 |
chitin binding |
77 |
5 |
2.89 |
9 |
0.00757 |
0.00757 |
0.00757 |
kable(
MF.all[MF.all$Significant < MF.all$Expected,],
caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
| 1 |
GO:0005515 |
protein binding |
2101 |
66 |
78.96 |
1 |
5e-08 |
1.6e-10 |
1.6e-09 |
| 3 |
GO:0005509 |
calcium ion binding |
361 |
11 |
13.57 |
4 |
0.00088 |
0.00088 |
0.00088 |
| 4 |
GO:0036459 |
thiol-dependent ubiquitinyl hydrolase ac… |
43 |
1 |
1.62 |
27 |
0.01870 |
0.00230 |
0.01870 |
| 5 |
GO:0004812 |
aminoacyl-tRNA ligase activity |
42 |
0 |
1.58 |
5 |
0.00240 |
0.00240 |
0.00240 |
| 8 |
GO:0008536 |
Ran GTPase binding |
14 |
0 |
0.53 |
6 |
0.00318 |
0.00318 |
0.00318 |
| 9 |
GO:0004707 |
MAP kinase activity |
6 |
0 |
0.23 |
7 |
0.00634 |
0.00634 |
0.00634 |
| 13 |
GO:0005524 |
ATP binding |
786 |
19 |
29.54 |
10 |
0.00877 |
0.00877 |
0.00877 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MF.weight01)[sigTerms]),
caption = paste('Molecular function terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with hatching according to all 3 algorithms
| GO:0005515 |
protein binding |
Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). |
0.0000000 |
| GO:0008417 |
fucosyltransferase activity |
Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. |
0.0007586 |
| GO:0005509 |
calcium ion binding |
Interacting selectively and non-covalently with calcium ions (Ca2+). |
0.0008786 |
| GO:0004812 |
aminoacyl-tRNA ligase activity |
Catalysis of the formation of aminoacyl-tRNA from ATP, amino acid, and tRNA with the release of diphosphate and AMP. |
0.0024049 |
| GO:0016757 |
transferase activity, transferring glycosyl groups |
Catalysis of the transfer of a glycosyl group from one compound (donor) to another (acceptor). |
0.0025943 |
| GO:0008536 |
Ran GTPase binding |
Interacting selectively and non-covalently with Ran, a conserved Ras-like GTP-binding protein, implicated in nucleocytoplasmic transport, cell cycle progression, spindle assembly, nuclear organization and nuclear envelope (NE) assembly. |
0.0031825 |
| GO:0004707 |
MAP kinase activity |
Catalysis of the reaction: protein + ATP = protein phosphate + ADP. This reaction is the phosphorylation of proteins. Mitogen-activated protein kinase; a family of protein kinases that perform a crucial step in relaying signals from the plasma membrane to the nucleus. They are activated by a wide range of proliferation- or differentiation-inducing signals; activation is strong with agonists such as polypeptide growth factors and tumor-promoting phorbol esters, but weak (in most cell backgrounds) by stress stimuli. |
0.0063440 |
| GO:0008061 |
chitin binding |
Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0075727 |
| GO:0005524 |
ATP binding |
Interacting selectively and non-covalently with ATP, adenosine 5’-triphosphate, a universally important coenzyme and enzyme regulator. |
0.0087728 |
| GO:0008375 |
acetylglucosaminyltransferase activity |
Catalysis of the transfer of an N-acetylglucosaminyl residue from UDP-N-acetyl-glucosamine to a sugar. |
0.0092534 |
showSigOfNodes(GOdata.MF, score(MF.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 61
## Number of Edges = 78
##
## $complete.dag
## [1] "A graph with 61 nodes."
showGroupDensity(GOdata.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component
orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms
CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
CC.all[CC.all$Significant > CC.all$Expected,],
caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
| 3 |
GO:0005815 |
microtubule organizing center |
37 |
2 |
1.43 |
11 |
0.01340 |
0.0046 |
0.01340 |
| 4 |
GO:0005885 |
Arp2/3 protein complex |
8 |
2 |
0.31 |
5 |
0.00799 |
0.0080 |
0.00799 |
kable(
CC.all[CC.all$Significant < CC.all$Expected,],
caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
| 1 |
GO:0030286 |
dynein complex |
26 |
0 |
1.00 |
1 |
0.00024 |
0.0022 |
0.00024 |
| 2 |
GO:0032991 |
protein-containing complex |
886 |
20 |
34.17 |
31 |
0.04084 |
0.0037 |
0.03349 |
| 5 |
GO:0005680 |
anaphase-promoting complex |
7 |
0 |
0.27 |
7 |
0.01124 |
0.0112 |
0.01124 |
| 6 |
GO:0044428 |
nuclear part |
236 |
4 |
9.10 |
39 |
0.05022 |
0.0122 |
0.07962 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CC.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
| GO:0030286 |
dynein complex |
Any of several large complexes that contain two or three dynein heavy chains and several light chains, and have microtubule motor activity. |
0.0021741 |
| GO:0005885 |
Arp2/3 protein complex |
A stable protein complex that contains two actin-related proteins, Arp2 and Arp3, and five novel proteins (ARPC1-5), and functions in the nucleation of branched actin filaments. |
0.0079859 |
showSigOfNodes(GOdata.CC, score(CC.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 30
## Number of Edges = 48
##
## $complete.dag
## [1] "A graph with 30 nodes."
showGroupDensity(GOdata.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Using the portion of variance explained by hatching
Building the topGO object
I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by hatching. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to p-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by hatching to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by hatching.
selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'BP',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.MF <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'MF',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
GOdataVar.CC <- new('topGOdata',
description = 'Differential expression among Brachionus plicatilis populations.',
ontology = 'CC',
allGenes = 1.0 - Variance,
annot = annFUN.gene2GO,
gene2GO = allgenes2GO,
geneSelectionFun = selection,
nodeSize = 5)
library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
| BP |
15162 |
1154 |
| MF |
17429 |
576 |
| CC |
12940 |
291 |
Running the tests
BPvar.elim <- runTest(GOdataVar.BP, algorithm = "elim", statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea <- runTest(GOdataVar.BP, algorithm = "lea", statistic = "ks")
MFvar.elim <- runTest(GOdataVar.MF, algorithm = "elim", statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea <- runTest(GOdataVar.MF, algorithm = "lea", statistic = "ks")
CCvar.elim <- runTest(GOdataVar.CC, algorithm = "elim", statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea <- runTest(GOdataVar.CC, algorithm = "lea", statistic = "ks")
ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
algorithm = rep(c("elim", "weight01", "lea"), 3),
TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))
kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
| BP |
elim |
1154 |
39 |
| BP |
weight01 |
1154 |
18 |
| BP |
lea |
1154 |
74 |
| MF |
elim |
576 |
22 |
| MF |
weight01 |
576 |
17 |
| MF |
lea |
576 |
29 |
| CC |
elim |
291 |
11 |
| CC |
weight01 |
291 |
9 |
| CC |
lea |
291 |
22 |
rm(ResultsSummary)
Results
Biological process
orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms
BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
BPvar.all[BPvar.all$Significant > BPvar.all$Expected,],
caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
| 1 |
GO:0046907 |
intracellular transport |
202 |
30 |
19.97 |
136 |
0.03824 |
0.00018 |
0.00241 |
| 3 |
GO:0007264 |
small GTPase mediated signal transductio… |
81 |
12 |
8.01 |
3 |
0.00034 |
0.00075 |
0.00034 |
| 7 |
GO:0007020 |
microtubule nucleation |
11 |
2 |
1.09 |
10 |
0.00224 |
0.00224 |
0.00224 |
| 8 |
GO:0015931 |
nucleobase-containing compound transport |
29 |
3 |
2.87 |
211 |
0.08131 |
0.00225 |
0.08131 |
| 9 |
GO:1901264 |
carbohydrate derivative transport |
11 |
2 |
1.09 |
150 |
0.04630 |
0.00225 |
0.04630 |
| 11 |
GO:0034314 |
Arp2/3 complex-mediated actin nucleation |
12 |
4 |
1.19 |
17 |
0.00291 |
0.00291 |
0.00291 |
| 12 |
GO:0006417 |
regulation of translation |
16 |
3 |
1.58 |
47 |
0.01119 |
0.00423 |
0.01119 |
| 14 |
GO:0048193 |
Golgi vesicle transport |
43 |
6 |
4.25 |
428 |
0.25589 |
0.00506 |
0.36524 |
| 16 |
GO:0032006 |
regulation of TOR signaling |
7 |
3 |
0.69 |
28 |
0.00672 |
0.00672 |
0.00672 |
| 17 |
GO:0006302 |
double-strand break repair |
21 |
4 |
2.08 |
8 |
0.00189 |
0.00724 |
0.00189 |
| 18 |
GO:1902533 |
positive regulation of intracellular sig… |
6 |
3 |
0.59 |
35 |
0.00967 |
0.00967 |
0.00967 |
| 20 |
GO:0006012 |
galactose metabolic process |
6 |
2 |
0.59 |
46 |
0.01093 |
0.01093 |
0.01093 |
| 22 |
GO:0046854 |
phosphatidylinositol phosphorylation |
11 |
2 |
1.09 |
62 |
0.01364 |
0.01364 |
0.01364 |
| 23 |
GO:0006672 |
ceramide metabolic process |
8 |
3 |
0.79 |
336 |
0.16508 |
0.01517 |
0.16508 |
| 25 |
GO:0006357 |
regulation of transcription by RNA polym… |
58 |
7 |
5.73 |
119 |
0.03000 |
0.01610 |
0.03000 |
| 26 |
GO:0006606 |
protein import into nucleus |
9 |
2 |
0.89 |
84 |
0.01887 |
0.01887 |
0.01887 |
| 27 |
GO:0006325 |
chromatin organization |
60 |
13 |
5.93 |
49 |
0.01146 |
0.01903 |
0.00282 |
| 28 |
GO:0007275 |
multicellular organism development |
41 |
5 |
4.05 |
335 |
0.16442 |
0.01931 |
0.16442 |
| 29 |
GO:0051650 |
establishment of vesicle localization |
6 |
2 |
0.59 |
90 |
0.01972 |
0.01972 |
0.01972 |
| 30 |
GO:0006506 |
GPI anchor biosynthetic process |
24 |
5 |
2.37 |
92 |
0.02027 |
0.02027 |
0.02027 |
| 31 |
GO:0009967 |
positive regulation of signal transducti… |
9 |
5 |
0.89 |
99 |
0.02283 |
0.02283 |
0.00026 |
| 32 |
GO:0016573 |
histone acetylation |
12 |
5 |
1.19 |
101 |
0.02325 |
0.02325 |
0.02325 |
| 34 |
GO:0031323 |
regulation of cellular metabolic process |
423 |
53 |
41.82 |
163 |
0.05053 |
0.02474 |
0.00081 |
| 35 |
GO:0009968 |
negative regulation of signal transducti… |
9 |
3 |
0.89 |
18 |
0.00300 |
0.02493 |
0.00300 |
| 36 |
GO:0050953 |
sensory perception of light stimulus |
8 |
2 |
0.79 |
666 |
0.49177 |
0.02495 |
0.49177 |
| 37 |
GO:0007156 |
homophilic cell adhesion via plasma memb… |
21 |
3 |
2.08 |
106 |
0.02580 |
0.02580 |
0.02580 |
| 38 |
GO:0009894 |
regulation of catabolic process |
5 |
2 |
0.49 |
112 |
0.02793 |
0.02793 |
0.02793 |
kable(
BPvar.all[BPvar.all$Significant < BPvar.all$Expected,],
caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
| 2 |
GO:0060271 |
cilium assembly |
46 |
2 |
4.55 |
1 |
7e-05 |
0.00063 |
7e-05 |
| 4 |
GO:0006486 |
protein glycosylation |
75 |
7 |
7.41 |
19 |
0.00314 |
0.00092 |
0.00314 |
| 5 |
GO:0006511 |
ubiquitin-dependent protein catabolic pr… |
102 |
7 |
10.08 |
456 |
0.29323 |
0.00113 |
0.29323 |
| 6 |
GO:0006030 |
chitin metabolic process |
74 |
4 |
7.32 |
55 |
0.01247 |
0.00163 |
0.01247 |
| 10 |
GO:0006413 |
translational initiation |
18 |
0 |
1.78 |
12 |
0.00268 |
0.00268 |
0.00268 |
| 13 |
GO:0040029 |
regulation of gene expression, epigeneti… |
6 |
0 |
0.59 |
23 |
0.00481 |
0.00481 |
0.00481 |
| 15 |
GO:0016310 |
phosphorylation |
319 |
21 |
31.54 |
557 |
0.38238 |
0.00570 |
0.63116 |
| 19 |
GO:0016579 |
protein deubiquitination |
40 |
3 |
3.95 |
44 |
0.01061 |
0.01061 |
0.01061 |
| 21 |
GO:0006418 |
tRNA aminoacylation for protein translat… |
40 |
1 |
3.95 |
57 |
0.01302 |
0.01302 |
0.01302 |
| 24 |
GO:0044262 |
cellular carbohydrate metabolic process |
28 |
1 |
2.77 |
875 |
0.70059 |
0.01608 |
0.70059 |
| 33 |
GO:0033365 |
protein localization to organelle |
38 |
3 |
3.76 |
30 |
0.00726 |
0.02412 |
0.00726 |
| 39 |
GO:0006457 |
protein folding |
49 |
3 |
4.84 |
113 |
0.02800 |
0.02800 |
0.02800 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(BPvar.weight01)[sigTerms]),
caption = paste('Biological process terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
| GO:0060271 |
cilium assembly |
The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. |
0.0006298 |
| GO:0007264 |
small GTPase mediated signal transduction |
Any series of molecular signals in which a small monomeric GTPase relays one or more of the signals. |
0.0007542 |
| GO:0006486 |
protein glycosylation |
A protein modification process that results in the addition of a carbohydrate or carbohydrate derivative unit to a protein amino acid, e.g. the addition of glycan chains to proteins. |
0.0009226 |
| GO:0007020 |
microtubule nucleation |
The process in which tubulin alpha-beta heterodimers begin aggregation to form an oligomeric tubulin structure (a microtubule seed). Microtubule nucleation is the initiating step in the formation of a microtubule in the absence of any existing microtubules (‘de novo’ microtubule formation). |
0.0022416 |
| GO:0006413 |
translational initiation |
The process preceding formation of the peptide bond between the first two amino acids of a protein. This includes the formation of a complex of the ribosome, mRNA or circRNA, and an initiation complex that contains the first aminoacyl-tRNA. |
0.0026807 |
| GO:0034314 |
Arp2/3 complex-mediated actin nucleation |
The actin nucleation process in which actin monomers combine to form a new branch on the side of an existing actin filament; mediated by the Arp2/3 protein complex and its interaction with other proteins. |
0.0029146 |
| GO:0040029 |
regulation of gene expression, epigenetic |
Any process that modulates the frequency, rate or extent of gene expression; the process is mitotically or meiotically heritable, or is stably self-propagated in the cytoplasm of a resting cell, and does not entail a change in DNA sequence. |
0.0048076 |
| GO:0032006 |
regulation of TOR signaling |
Any process that modulates the frequency, rate or extent of TOR signaling. |
0.0067152 |
| GO:0006302 |
double-strand break repair |
The repair of double-strand breaks in DNA via homologous and nonhomologous mechanisms to reform a continuous DNA helix. |
0.0072352 |
| GO:1902533 |
positive regulation of intracellular signal transduction |
NA |
0.0096696 |
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 290
## Number of Edges = 563
##
## $complete.dag
## [1] "A graph with 290 nodes."
Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function
orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms
MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
MFvar.all[MFvar.all$Significant > MFvar.all$Expected,],
caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
| 1 |
GO:0005515 |
protein binding |
2101 |
235 |
207.34 |
1 |
3.9e-07 |
1.9e-08 |
4.4e-09 |
| 2 |
GO:0008536 |
Ran GTPase binding |
14 |
3 |
1.38 |
2 |
4.7e-05 |
4.7e-05 |
4.7e-05 |
| 3 |
GO:0043015 |
gamma-tubulin binding |
9 |
3 |
0.89 |
3 |
0.00012 |
0.00012 |
0.00012 |
| 7 |
GO:0008417 |
fucosyltransferase activity |
33 |
4 |
3.26 |
8 |
0.00114 |
0.00114 |
0.00114 |
| 9 |
GO:0003714 |
transcription corepressor activity |
10 |
5 |
0.99 |
12 |
0.00369 |
0.00369 |
0.00369 |
| 11 |
GO:0016757 |
transferase activity, transferring glyco… |
194 |
21 |
19.15 |
66 |
0.03258 |
0.00403 |
1.1e-05 |
| 12 |
GO:0060090 |
molecular adaptor activity |
10 |
1 |
0.99 |
122 |
0.12197 |
0.00527 |
0.12197 |
| 14 |
GO:0004386 |
helicase activity |
46 |
8 |
4.54 |
44 |
0.01886 |
0.00713 |
0.01886 |
| 15 |
GO:0005096 |
GTPase activator activity |
34 |
4 |
3.36 |
16 |
0.00735 |
0.00735 |
0.00735 |
| 16 |
GO:0004312 |
fatty acid synthase activity |
5 |
3 |
0.49 |
19 |
0.00912 |
0.00912 |
0.00912 |
| 17 |
GO:0005102 |
signaling receptor binding |
30 |
4 |
2.96 |
112 |
0.09745 |
0.00973 |
0.09745 |
| 18 |
GO:0004842 |
ubiquitin-protein transferase activity |
76 |
13 |
7.50 |
65 |
0.03179 |
0.01003 |
0.03179 |
| 20 |
GO:0008194 |
UDP-glycosyltransferase activity |
35 |
5 |
3.45 |
10 |
0.00203 |
0.01153 |
0.00203 |
| 21 |
GO:0003993 |
acid phosphatase activity |
8 |
1 |
0.79 |
26 |
0.01198 |
0.01198 |
0.01198 |
kable(
MFvar.all[MFvar.all$Significant < MFvar.all$Expected,],
caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
| 4 |
GO:0003743 |
translation initiation factor activity |
28 |
1 |
2.76 |
4 |
0.00015 |
0.00015 |
0.00015 |
| 5 |
GO:0008061 |
chitin binding |
77 |
7 |
7.60 |
5 |
0.00026 |
0.00026 |
0.00026 |
| 6 |
GO:0042578 |
phosphoric ester hydrolase activity |
140 |
9 |
13.82 |
495 |
0.86923 |
0.00082 |
0.86923 |
| 8 |
GO:1901505 |
carbohydrate derivative transmembrane tr… |
22 |
2 |
2.17 |
264 |
0.37713 |
0.00221 |
0.37713 |
| 10 |
GO:0036459 |
thiol-dependent ubiquitinyl hydrolase ac… |
43 |
3 |
4.24 |
51 |
0.02297 |
0.00395 |
0.02297 |
| 13 |
GO:0005524 |
ATP binding |
786 |
66 |
77.57 |
15 |
0.00571 |
0.00571 |
0.00571 |
| 19 |
GO:0015932 |
nucleobase-containing compound transmemb… |
23 |
2 |
2.27 |
288 |
0.44074 |
0.01062 |
0.44074 |
| 22 |
GO:0004707 |
MAP kinase activity |
6 |
0 |
0.59 |
27 |
0.01238 |
0.01238 |
0.01238 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(MFvar.weight01)[sigTerms]),
caption = paste('Molecular functions terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with hatching according to all 3 algorithms
| GO:0005515 |
protein binding |
Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). |
0.0000000 |
| GO:0008536 |
Ran GTPase binding |
Interacting selectively and non-covalently with Ran, a conserved Ras-like GTP-binding protein, implicated in nucleocytoplasmic transport, cell cycle progression, spindle assembly, nuclear organization and nuclear envelope (NE) assembly. |
0.0000471 |
| GO:0043015 |
gamma-tubulin binding |
Interacting selectively and non-covalently with the microtubule constituent protein gamma-tubulin. |
0.0001188 |
| GO:0003743 |
translation initiation factor activity |
Functions in the initiation of ribosome-mediated translation of mRNA into a polypeptide. |
0.0001548 |
| GO:0008061 |
chitin binding |
Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. |
0.0002588 |
| GO:0008417 |
fucosyltransferase activity |
Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. |
0.0011400 |
| GO:0003714 |
transcription corepressor activity |
A protein or a member of a complex that interacts specifically and non-covalently with a DNA-bound DNA-binding transcription factor to repress the transcription of specific genes. Corepressors often act by altering chromatin structure and modifications. For example, one class of transcription coregulators modifies chromatin structure through covalent modification of histones. A second ATP-dependent class modifies the conformation of chromatin. A third class occludes DNA-binding transcription factor protein-protein interaction domains. A third class of corepressors prevents interactions of DNA bound DNA-binding transcription factor with coactivators. |
0.0036872 |
| GO:0005524 |
ATP binding |
Interacting selectively and non-covalently with ATP, adenosine 5’-triphosphate, a universally important coenzyme and enzyme regulator. |
0.0057069 |
| GO:0005096 |
GTPase activator activity |
Binds to and increases the activity of a GTPase, an enzyme that catalyzes the hydrolysis of GTP. |
0.0073531 |
| GO:0004312 |
fatty acid synthase activity |
Catalysis of the reaction: acetyl-CoA + n malonyl-CoA + 2n NADPH + 2n H+ = long-chain fatty acid + n+1 CoA + n CO2 + 2n NADP+. |
0.0091154 |
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
firstSigNodes = sum(significant.elim),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 81
## Number of Edges = 97
##
## $complete.dag
## [1] "A graph with 81 nodes."
I plot variance portion, but for the term found most significant when using p-values, for comparison.
showGroupDensity(GOdataVar.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component
orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms
CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))
kable(
CCvar.all[CCvar.all$Significant > CCvar.all$Expected,],
caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
| 3 |
GO:0000790 |
nuclear chromatin |
24 |
3 |
2.34 |
41 |
0.03543 |
0.00231 |
0.03543 |
| 4 |
GO:0044428 |
nuclear part |
236 |
34 |
23.02 |
7 |
0.00424 |
0.00354 |
3.4e-05 |
| 6 |
GO:0005680 |
anaphase-promoting complex |
7 |
4 |
0.68 |
9 |
0.00648 |
0.00648 |
0.00648 |
| 8 |
GO:0030015 |
CCR4-NOT core complex |
8 |
4 |
0.78 |
11 |
0.00806 |
0.00806 |
0.00806 |
| 9 |
GO:1902562 |
H4 histone acetyltransferase complex |
10 |
2 |
0.98 |
35 |
0.02728 |
0.00821 |
0.02728 |
| 11 |
GO:0044451 |
nucleoplasm part |
96 |
14 |
9.36 |
6 |
0.00349 |
0.01954 |
0.00349 |
kable(
CCvar.all[CCvar.all$Significant < CCvar.all$Expected,],
caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
| 1 |
GO:0005852 |
eukaryotic translation initiation factor… |
13 |
0 |
1.27 |
1 |
0.00031 |
0.00031 |
0.00031 |
| 2 |
GO:0032991 |
protein-containing complex |
886 |
84 |
86.41 |
12 |
0.01080 |
0.00229 |
8.5e-05 |
| 5 |
GO:0016272 |
prefoldin complex |
7 |
0 |
0.68 |
8 |
0.00484 |
0.00484 |
0.00484 |
| 7 |
GO:0005783 |
endoplasmic reticulum |
68 |
4 |
6.63 |
128 |
0.29861 |
0.00773 |
0.29861 |
| 10 |
GO:0005737 |
cytoplasm |
634 |
53 |
61.83 |
42 |
0.03675 |
0.01317 |
0.02464 |
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
lea=significant.lea,
elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
Definition=Definition(GOTERM[sigTerms]),
PValue=score(CCvar.weight01)[sigTerms]),
caption = paste('Cellular component terms significantly associated with', VAR,
'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
| GO:0005852 |
eukaryotic translation initiation factor 3 complex |
A complex of several polypeptides that plays at least two important roles in protein synthesis: First, eIF3 binds to the 40S ribosome and facilitates loading of the Met-tRNA/eIF2.GTP ternary complex to form the 43S preinitiation complex. Subsequently, eIF3 apparently assists eIF4 in recruiting mRNAs to the 43S complex. The eIF3 complex contains five conserved core subunits, and may contain several additional proteins; the non-core subunits are thought to mediate association of the complex with specific sets of mRNAs. |
0.0003090 |
| GO:0044428 |
nuclear part |
Any constituent part of the nucleus, a membrane-bounded organelle of eukaryotic cells in which chromosomes are housed and replicated. |
0.0035361 |
| GO:0016272 |
prefoldin complex |
A multisubunit chaperone that is capable of delivering unfolded proteins to cytosolic chaperonin, which it acts as a cofactor for. In humans, the complex is a heterohexamer of two PFD-alpha and four PFD-beta type subunits. In Saccharomyces cerevisiae, it also acts in the nucleus to regulate the rate of elongation by RNA polymerase II via a direct effect on histone dynamics. |
0.0048382 |
| GO:0005680 |
anaphase-promoting complex |
A ubiquitin ligase complex that degrades mitotic cyclins and anaphase inhibitory protein, thereby triggering sister chromatid separation and exit from mitosis. Substrate recognition by APC occurs through degradation signals, the most common of which is termed the Dbox degradation motif, originally discovered in cyclin B. |
0.0064777 |
| GO:0030015 |
CCR4-NOT core complex |
The core of the CCR4-NOT complex. In Saccharomyces the CCR4-NOT core complex comprises Ccr4p, Caf1p, Caf40p, Caf130p, Not1p, Not2p, Not3p, Not4p, and Not5p. |
0.0080568 |
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
firstSigNodes = sum(significant.weight01),
wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 46
## Number of Edges = 78
##
## $complete.dag
## [1] "A graph with 46 nodes."
For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.
showGroupDensity(GOdataVar.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Comparison between the two ordering of genes.
Biological process
allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% BP.pvalue.sigTerms,
Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(BP.weight01))[allTerms],
Variance = rank(score(BPvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of BP terms by significance')

Molecular function
allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% MF.pvalue.sigTerms,
Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(MF.weight01))[allTerms],
Variance = rank(score(MFvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of MF terms by significance')

Cellular component
allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue = allTerms %in% CC.pvalue.sigTerms,
Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue = rank(score(CC.weight01))[allTerms],
Variance = rank(score(CCvar.weight01))[allTerms]),
mapping = aes(x = PValue, y = Variance)) +
geom_point() + geom_smooth() + xlab('Using gene p-values') +
ylab('Using portion of explained variance') +
ggtitle('Ordering of CC terms by significance')

Session info
I save the main variables to be able to load them in a new R session later.
save(allgenes2GO,
GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=ca_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=ca_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=ca_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Rgraphviz_2.30.0 ggplot2_3.2.1 limma_3.42.0
## [4] knitr_1.27 topGO_2.38.1 SparseM_1.78
## [7] GO.db_3.10.0 AnnotationDbi_1.48.0 IRanges_2.20.2
## [10] S4Vectors_0.24.3 Biobase_2.46.0 graph_1.64.0
## [13] BiocGenerics_0.32.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.12 purrr_0.3.3 splines_3.6.2
## [5] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.1 htmltools_0.4.0
## [9] yaml_2.2.0 mgcv_1.8-31 blob_1.2.1 rlang_0.4.2
## [13] pillar_1.4.3 glue_1.3.1 withr_2.1.2 DBI_1.1.0
## [17] bit64_0.9-7 matrixStats_0.55.0 lifecycle_0.1.0 stringr_1.4.0
## [21] munsell_0.5.0 gtable_0.3.0 memoise_1.1.0 evaluate_0.14
## [25] labeling_0.3 highr_0.8 Rcpp_1.0.3 backports_1.1.5
## [29] scales_1.1.0 farver_2.0.3 bit_1.1-15.1 digest_0.6.23
## [33] stringi_1.4.5 dplyr_0.8.3 tools_3.6.2 magrittr_1.5
## [37] lazyeval_0.2.2 tibble_2.1.3 RSQLite_2.2.0 crayon_1.3.4
## [41] pkgconfig_2.0.3 zeallot_0.1.0 Matrix_1.2-18 assertthat_0.2.1
## [45] rmarkdown_2.1 R6_2.4.1 nlme_3.1-143 compiler_3.6.2